Basic theory of
Feed-forwa
rd Neural Networks
Claudio Mirabello
Resources
MIT
lectures on Deep Lea
rning
(
http://introtodeeplearning.com/
)
T
ensorFlow Playg
round
(
https://playground.tensorflow
.org
)
Keras Docs
(https://keras.io)
This is a neuron
wikipedia.org
This is a perceptron (1958)
Mark I Perceptron ma
chine
wikipedia.org
MIT “Intro
to Deep Learning”
Perceptrons caused excitement
"the embryo of an electronic computer that [the Navy] expects
will be
able to walk, talk, see, write, reproduce itself and be
conscious of its existence."
The New
Y
ork T
im
es
Perceptrons can only learn
linearly separable classes
Min
sky and
Paper
t, 19
69
Perceptrons can only learn
linearly separable classes
T
en
sorFlow Playground
But sometimes you want
to model non-linear functions
How do we make this non-linear then?
T
wo ingredients to add
1: Dif
ferentiab
le, non-linear a
ctivation functions
Common activation functions
Special case: softmax
●
Used in classifica
tion problems
●
Given k classes, it decides wh
ich one is more likely
●
One output per class, each output is assigne
d a probability
from 0 to 1
●
The sum of probabilities for all
outputs is 1
g
W
ait a second, the perceptro
n already has a
non-linear (step) activation function!
This is a perceptron
2: Multi-layer Perceptron (1986)
Now we're getting somewhere
Why stop at one hidden layer?
Deep Networks are simply
NNs with multiple hidden layers
Deeper
this way
https://playground.tensorflow
.org
Let's review:
- Perceptron
- XOR problem
-
A
ctivations
- Multi-layer perceptron
How do we decide
which weights are optimal?
●
A
linear regressor's weights
(coef
ficients)
are calculated in c
losed form
●
This can't be done if you have
hidden layers
and non-linear activations
How do we decide
which weights are optimal?
How do we decide which weights are optimal?
Lower loss => better predictions
Minimizing loss
Minimizing loss
Minimizing loss
Minimizing loss
Minimizing loss
Gradient descent
Activation functions have to be dif
ferentiable!
The learning rate η
Backpropagation
Backpropagation
Backpropag
ation example, step by step:
https://mattmazur
.com/2015/03/17/a-step-
by-step-backpropagation-e
xample/
Practical example
Simple network:
T
wo i
nputs [x1, x2]
T
wo we
ights [w1, w2]
No bi
as
Activation
functio
n g( )
One ou
t
put
ŷ
One labe
l y
Loss functi
on
( )
𝓛
W
eight-de
pendent
error J(W)
x1
x2
ŷ
w
1
w
2
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
1. Forward pass
x1
= 0.50
x2
= 0.51
w1
= 0.35
w2
= 0.40
Σ
= ?
ŷ
=
g(
Σ) =
1/(1+e
-
Σ
) = ?
0.35
0.40
0.50
0.51
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
1. Forward pass
x1
=
0.50
x2
=
0.51
w1
= 0.35
w2
= 0.40
Σ
= 0.35 * x1 + 0.4 * w2 =
0.38
ŷ
= g(Σ) =
1/(1+e
-Σ
) =
0.59
0.50
0.51
0.35
0.40
0.38
0.59
0.35
0.40
0.50
0.51
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
2. Calculate Loss
x1
=
0.50
x2
=
0.51
w1
= 0.35
w2
= 0.40
Σ
= 0.35 * x1 + 0.4 * w2 =
0.38
ŷ
= g(Σ) =
1/(1+e
-Σ
) =
0.59
Y
=
0.10
J(W)
=
(y
, ŷ) = ½ *
(y – ŷ)
𝓛
2
= ?
0.35
0.40
0.50
0.51
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
0.59
0.38
2. Calculate Loss
x1
=
0.50
x2
=
0.51
w1
= 0.35
w2
= 0.40
Σ
= 0.35 * x1 + 0.4 * w2 =
0.38
ŷ
= g(Σ) =
1/(1+e
-Σ
) =
0.59
Y
=
0.10
J(W)
=
(y
, ŷ) = ½ *
(y – ŷ)
𝓛
2
=
0.12
0.35
0.40
0.50
0.51
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
0.59
0.12
0.38
0.10
3. Backpropagate the error
J(W) =
(
𝓛
g(
X * W
)
)
∂J(W)/∂w1
=
∂J(W) / ∂ŷ
* ∂ŷ /
Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
Derivatives ca
lculated in this order
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
3. Backpropagate the error: loss
J(W) =
(
𝓛
g(X * W)
)
∂J(W)/∂w1
=
∂J(W) / ∂ŷ
* ∂ŷ /
Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
J(W) =
(y
, ŷ)
𝓛
= ½ * (y – ŷ)
2
∂J(W)
/ ∂ŷ
= ∂
(y
,
ŷ)
𝓛
/ ∂ŷ = ?
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
3. Backpropagate the error: loss
J(W) =
(
𝓛
g(X * W)
)
∂J(W)/∂w1
=
∂J(W) / ∂ŷ
* ∂ŷ /
Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
J(W) =
(y
, ŷ)
𝓛
= ½ * (y – ŷ)
2
∂J(W)
/ ∂ŷ
= 2 * ½ * (y – ŷ) * -1
= -y + ŷ =
-0.1 + 0.59 =
0.49
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
3. Backpropagate the error: activation
J(W) =
(
𝓛
g(
X * W
)
)
∂J(W)/∂w1
=
0.49
* ∂ŷ / ∂
Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
ŷ
=
g(Σ)
= 1/(1+e
-Σ
)
∂ŷ
/ ∂Σ
= ?
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
3. Backpropagate the error: activation
J(W) =
(
𝓛
g(
X * W
)
)
∂J(W)/∂w1
=
0.49
* ∂ŷ / ∂
Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
ŷ
=
g(Σ)
= 1/(1+e
-Σ
)
∂ŷ
/ ∂Σ
= 1/(1+e
-Σ
) * (1 – 1/(1+e
-Σ
))
=
0.24
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
3. Backpropagate the error: weight
J(W) =
(g(
𝓛
X * W
))
∂J(W)/∂w1
=
0.49
* 0.24
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
Σ =
X * W
= x1 * w1 + x2 * w2
∂Σ/∂w1
= ?
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
3. Backpropagate the error: weight
J(W) =
(g(
𝓛
X * W
))
∂J(W)/∂w1
=
0.49
* 0.24
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
Σ =
X * W
= x1 * w1 + x2 * w2
∂Σ/∂w1
= x1 + 0 =
0.5
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
4. W
eight update
J(W) =
(
𝓛
g(
X * W
)
)
∂J(W)/∂w1
=
0.49
* 0.24
* 0.51
= 0.06
(
grad
ient)
w1’
= ?
(loss)
(activation)
(weight)
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
4. W
eight update
J(W) =
(
𝓛
g(
X * W
)
)
∂J(W)/∂w1
=
0.49
* 0.24
* 0.51
= 0.06
(
grad
ient)
w1’
= w1 – η*0
.
06 = 0.35 – 0.06 = 0.29
(loss)
(activation)
(weight)
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
Exercise
Can you calcula
t
e the weigh
t
update for w2? How
many new
gradients do you nee
d
to calcula
te?
What is the new predicted output? H
as the error gone down?
What if I had another layer before this one?
0.50
0.51
0.35
0.29
0.40
0.34
0.38
0.32
0.59
0.58
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
4. W
eight update
J(W) =
(
𝓛
g(
X * W
)
)
∂J(W)/∂w1
=
0.49
* 0.24
* 0.51
= 0.06
(
grad
ient)
w1’
= w1 – η*0
.
06 = 0.35 – 0.06 = 0.29
W2’
= w2 – η*0.06 = 0.4 – 0.06 = 0.34
(loss)
(activation)
(weight)
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
Gradient vanishing
∂J/∂W1 = ∂J/∂Ŷ * ∂Ŷ/∂
Σ
N
* ∂Σ
N
/∂W
N
* ... * ∂Z
k
/∂Σ
k
* ∂Σ
k
/∂W
k
* … ∂Z
1
/∂Σ
1
* ∂Σ
1
/∂W
1
What happens if we backpropa
gate on a network with many
(N > k >
1) hidd
en layers?
Gradient vanishing
∂E/∂W1 = ∂J/∂Ŷ * ∂Ŷ/∂
Σ
N
* ∂Σ
N
/∂W
N
* ... * ∂Z
k
/∂Σ
k
* ∂Σ
k
/∂W
k
* … ∂Z
1
/∂Σ
1
* ∂Σ
1
/∂X
1
= O(10
-N
)
initial w1 = 0.5
optimal w1 = -0.2
5-layer gradi
ent ~
0.0000
1
How many iteration
s
do we
need
t
o get from 0.5 to -0.2?
These are all “zero-point-something
s” multiplied by each other
So the gradient become
s smaller by orders of magnitudes as we
go back more and more layers until it’
s so small that the network
is stuck